I. Introduction

Belgium has a decent public transport network, offering accessible and efficient travel across the country. Although geographically small, Belgium is bursting with popular cities like Brussels,Liege,Charleroi, Gent, and Antwerpen, connected by the vast network of the train system, connected by the vast network of the train system. Thanks to the dense railway network, taking trains has become the most efficient way to commute and travel among towns and cities.

However, with the increasing demand of taking trains as well as the unbalanced development across the country, some train lines can be extremely busy with a high occupancy level, while other lines might be under operated. City planners and train conductors are super interested in the tool which can help them analyze and estimate the train occupancy level, so that they can develop a better understanding of optimizing railway operation across the country.

Existing projects and applications have made efforts to address pertinent problems but more from the perspective of customers. They have leveraged the pre-existing data and real-time data to help passengers direct a route. This project as well as the synchronous App, nevertheless, targets transportation planners including city planners and train conductors, and is dedicated to smarter decisions and better service. To facilitate the effort, this project employs regression models with comprehensive features to predict occupancy level of each train line, and embeds the algorithm into the developing App to help monitor and analyze the occupancy issue.

Specifically, this project builds a Logistic Regression model to predict high/low occupancy rate of trains in Belgium on the station’s basis by integrating various data sources. Model is validated across 100 folders and space patterns. The project offers insights on resource allocation, route optimization for city planners and train conductors.

Belgium Railway Map

Click here to access the BeRail introduction video.

II. Data Loading and Feature Engineering

2.1 Data Loading

This project leverages 6 datasets in total.

  • occupancy: This is the fundamental dataset provided on Kaggle to identify the occupancy level of trains in Belgium in 2016. This project only works on the training set.

  • stations: This data table contains all information of train stations. The main hub stations in Belgium are Bruxelles Midi/Brussel Zuid, Brussel Centraal, Brussel Noord, Antwerpen Centraal, Gent Sint-Pieters and Luik Guillemins. At these train stations, it’s possible to connect to trains to Belgium’s main cities and many international destinations.

  • weather: The integrated dataset contains temperature, humidity, windspeed, and visibility data of each station per hour from July 2016 to October 2016.

  • facility: The dataset identifies the amenities of each station. Stations in Belgium usually have excellent facilities, often including: luggage lockers, foreign exchange desks, restaurants and cafés, tourist information offices, ATM cash machines, elevators and escalators, access for disabled passengers. To make it simple, we sum all facilities of each station when using this variable.

  • rail: A shapefile to show the railway pattern in Belgium.

  • lines: By connecting different stations, this dataset identifies train lines across Belgium.

2.2 Feature Engineering

To enrich the preliminary datasets provided above and build a robust model, this project generates plenty of features to show characteristics of train stations and train operations:

  • Station density: By leveraging nn_function, we create stationsdist to show the distance of the nearest three other stations around each station.

  • Station importance: Measured by population size, we filter out the five biggest cities, and we create majc_nn5 to show the distance between each station to the major cities.

  • Station occupancy level: We generate a series of features to identify the proportions of each occupancy level in this station. To make it more accurate, we conduct the calculation for the departure stations and destination stations separately.

  • Station group occupancy level: Addition to the occupancy level of each station, we also generate three more features to show the average proportion of each occupancy level of station groups. Each group consists of the station itself as well as the three nearest stations.

  • Station popularity: To show how busy one station can be, we calculate the connection number of each station. Occupancy level time lag: Creating time lag variables will add additional nuance about the demand during a given time period.

2.2.1 Station Density

Calculate the distance of nearest three stations around each station and show the outcome by occupancy level.

stations_copy <- stations

stations_copy$stationsdist <- nn_function(st_coordinates(stations), st_coordinates(stations), 3)

2.2.2 Distance of Stations to Major Cities

Codes below calculate the distances of each station to the five major cities: Brussels,Liege,Charleroi, Gent, and Antwerpen measured by population size.

name <- c("Brussels","Liege","Charleroi", "Gent", "Antwerpen")
lat <- c(50.85045,50.63373, 50.41136, 51.05, 51.21989)
lon <- c(4.34878, 5.56749, 4.44448, 3.71667, 4.40346)

majorcities <- tibble(name, lat, lon) %>%
  st_as_sf(coords = c("lon","lat"), crs = 4326, agr = "constant") %>%
  st_transform(31370)

stations_copy$majc_nn5 <- nn_function(st_coordinates(stations), st_coordinates(majorcities), 5)

stations_copy <-
  stations_copy %>%
  mutate(Brussels = sqrt((st_coordinates(.)[,1] - st_coordinates(majorcities)[1,1])^2 + (st_coordinates(.)[,2] - st_coordinates(majorcities)[1,2])^2)) %>%
  mutate(Liege = sqrt((st_coordinates(.)[,1] - st_coordinates(majorcities)[2,1])^2 + (st_coordinates(.)[,2] - st_coordinates(majorcities)[2,2])^2)) %>%
  mutate(Charleroi = sqrt((st_coordinates(.)[,1] - st_coordinates(majorcities)[3,1])^2 + (st_coordinates(.)[,2] - st_coordinates(majorcities)[3,2])^2)) %>%
  mutate(Gent = sqrt((st_coordinates(.)[,1] - st_coordinates(majorcities)[4,1])^2 + (st_coordinates(.)[,2] - st_coordinates(majorcities)[4,2])^2)) %>%
  mutate(Antwerpen = sqrt((st_coordinates(.)[,1] - st_coordinates(majorcities)[5,1])^2 + (st_coordinates(.)[,2] - st_coordinates(majorcities)[5,2])^2)) 

2.2.3 Occupancy Level Proportion of Each Station

Codes below are to calculate the proportion of each occupancy level for each origin station and destination station respectively.

## Origin Station
data_f_00_h <- data_f_00 %>%
  filter(occupancy=='high')

data_f_00_m <- data_f_00 %>%
  filter(occupancy=='medium')

data_f_00_l <- data_f_00 %>%
  filter(occupancy=='low')


sta_from_total <- as.data.frame(table(data_f_00$from)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_f=as.character(Var1)) %>%
  mutate(F_total=Freq) %>%
  select(3,4)

sta_from_h <- as.data.frame(table(data_f_00_h$from)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_f=as.character(Var1)) %>%
  mutate(F_H_freq=Freq) %>%
  select(3,4)

sta_from_m <- as.data.frame(table(data_f_00_m$from)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_f=as.character(Var1)) %>%
  mutate(F_M_freq=Freq) %>%
  select(3,4) 

sta_from_l <- as.data.frame(table(data_f_00_l$from)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_f=as.character(Var1)) %>%
  mutate(F_L_freq = Freq) %>%
  select(3,4)  


## Destination Station
data_t_00_h <- data_t_00 %>%
  filter(occupancy=='high')

data_t_00_m <- data_t_00 %>%
  filter(occupancy=='medium')

data_t_00_l <- data_t_00 %>%
  filter(occupancy=='low')


sta_to_total <- as.data.frame(table(data_t_00$to)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_t = as.character(Var1)) %>%
  mutate(T_total = Freq) %>%
  select(3,4)

sta_to_h <- as.data.frame(table(data_t_00_h$to)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_t = as.character(Var1)) %>%
  mutate(T_H_freq=Freq) %>%
  select(3,4)

sta_to_m <- as.data.frame(table(data_t_00_m$to)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_t=as.character(Var1)) %>%
  mutate(T_M_freq=Freq) %>%
  select(3,4) 

sta_to_l <- as.data.frame(table(data_t_00_l$to)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_t=as.character(Var1)) %>%
  mutate(T_L_freq = Freq) %>%
  select(3,4)  


stations_occ <- merge(stations,sta_from_total,by.x="station_name",by.y='station_f',all.x=T)
stations_occ <- merge(stations_occ,sta_from_h,by.x="station_name",by.y='station_f',all.x=T)
stations_occ <- merge(stations_occ,sta_from_m,by.x="station_name",by.y='station_f',all.x=T)
stations_occ <- merge(stations_occ,sta_from_l,by.x="station_name",by.y='station_f',all.x=T)

stations_occ <- merge(stations_occ,sta_to_total,by.x="station_name",by.y='station_t',all.x=T)
stations_occ <- merge(stations_occ,sta_to_h,by.x="station_name",by.y='station_t',all.x=T)
stations_occ <- merge(stations_occ,sta_to_m,by.x="station_name",by.y='station_t',all.x=T)
stations_occ <- merge(stations_occ,sta_to_l,by.x="station_name",by.y='station_t',all.x=T)

stations_occ$F_H_com <- stations_occ$F_H_freq/stations_occ$F_total
stations_occ$F_M_com <- stations_occ$F_M_freq/stations_occ$F_total
stations_occ$F_L_com <- stations_occ$F_L_freq/stations_occ$F_total
stations_occ$T_H_com <- stations_occ$T_H_freq/stations_occ$T_total
stations_occ$T_M_com <- stations_occ$T_M_freq/stations_occ$T_total
stations_occ$T_L_com <- stations_occ$T_L_freq/stations_occ$T_total
stations_occq <- select(stations_occ,1,14,15,16,17,18,19,20)

2.2.4 Average Occupancy Level Proportion by Station Group

Codes below calculate the average occupancy level proportion of the nearest three stations of each stations.

neighborList <- knn2nb(knearneigh(st_coordinates(stations), 3))
occupancy$from <- as.character(occupancy$from)
occupancy$to <- as.character(occupancy$to)

getnb_from <- function(x,i, from, occ){
    nbcount = occupancy %>%
      left_join(dplyr::select(stations_copy, c(station_name, stationsdist)), by = c(from = "station_name")) %>%
      filter(from %in% stations$station_name[neighborList[[i]]]) %>%
      tally()
    nbcount1 = occupancy %>%
      left_join(dplyr::select(stations_copy, c(station_name, stationsdist)), by = c(from = "station_name")) %>%
      filter(from %in% stations$station_name[neighborList[[i]]] & occupancy == occ) %>%
      tally()
    nbcount2 = nbcount1/nbcount
  return(nbcount2[,1])
}

#avg composition of low occupancy from neighbor from stations
nblist = list()
for (i in 1:length(neighborList)){
  nblist[[i]] = getnb_from(neighborList,i,"from","low")
}
stations_copy$nb_occ_f_low <- nblist

#avg composition of medium occupancy from neighbor from stations
nblist = list()
for (i in 1:length(neighborList)){
  nblist[[i]] = getnb_from(neighborList,i,"from","medium")
}
stations_copy$nb_occ_f_medium <- nblist

#avg composition of high occupancy from neighbor from stations
nblist = list()
for (i in 1:length(neighborList)){
  nblist[[i]] = getnb_from(neighborList,i,"from","high")
}
stations_copy$nb_occ_f_high <- nblist

#avg composition of low occupancy from neighbor to stations
nblist = list()
for (i in 1:length(neighborList)){
  nblist[[i]] = getnb_from(neighborList,i,"to","low")
}
stations_copy$nb_occ_t_low <- nblist


#avg composition of medium occupancy from neighbor to stations
nblist = list()
for (i in 1:length(neighborList)){
  nblist[[i]] = getnb_from(neighborList,i,"to","medium")
}
stations_copy$nb_occ_t_medium <- nblist


#avg composition of high occupancy from neighbor to stations
nblist = list()
for (i in 1:length(neighborList)){
  nblist[[i]] = getnb_from(neighborList,i,"to","high")
}
stations_copy$nb_occ_t_high <- nblist

2.2.5 Station Popularity

Then we explore the popularity of each station, in other words, we can calculate the number of lines connected to each origin station and destination station respectively.

connect <- function(line,stt){
  stt <- as.data.frame(stt) %>% 
    select(1)
  stt$connect <- NA

  for(i in seq(1:nrow(stt))){
    if(length(table(line$stops==stt[i,1]))==2){
      line_i <- line %>% 
        filter(stops==stt[i,1])
      stt$connect[i] <- nrow(line_i)
      }
    }
  return(stt)
}


lines_connect <- lines %>% 
  select(5,7) %>% 
  mutate(stops = gsub("\\[","",stopping_station_ids)) %>% 
  mutate(stops = gsub("\\]","",stops)) %>% 
  mutate(stops = gsub("'",  "",stops)) %>% 
  mutate(stops = gsub(" ",  "",stops)) %>% 
  unique() %>% 
  select(3) %>% 
  separate_rows(stops, sep=",") %>%
  group_by(stops) %>%
  summarize(connections = n())

stations_occ_line <- merge(stations_occq, lines_connect, 
                           by.x = "station_name", by.y = "stops", all.x = TRUE)

stations_final <- merge(st_drop_geometry(stations_copy),
                        st_drop_geometry(stations_occ_line),
                        by = "station_name") %>%
  select(-name,-country.code,-avg_stop_times,-facility_sum)

2.2.6 Occupancy Level Time Lag

Since the period of given data barely covers holidays, thus this analysis creates lagHour, lag2Hours,lag3Hours, lag4Hours, and lag12Hours to show occupancy level of each shift before and after a certain time whth holiday effect ruled out.

III. Exploratory Data Analysis

This section is divided into five subsections:

  • Station Characteristics: This subsection creates a series of maps to show the station density by occupancy level, station’s distance to major cities, high occupancy level proportion of each station, station popularity, facility characteristics.

  • Weather Correlation: This subsection focuses on the weather conditions and correlations on each OD set.

  • Serial Autocorrelation: This subsection consists of three sets of plots, exhibiting train counts bu occupancy level on a general basis, weekend and weekdays basis, as well as a daily basis.

  • Spatial Autocorrelation: This subsection creates animation to show the 20 busiest lines.

  • Space/Time Correlation: This subsection creates two plots to show occupancy level across departure stations and destination stations by each week days.

The findings in this section indicate that occupancy levels might be influenced by temporal and spatial factors, station existing characteristics and weather conditions.

3.1 Stations Characteristics

3.1.1 Station Density by Occupancy Level

Plots below identify the station density across occupancy levels, and regardless of occupancy, the density is decreasing from the central part of the country to the periphery, indicating the overall occupancy level should be different between center and periphery.

ggplot() +
  geom_sf(data = rail, color = "grey") +  
  geom_sf(data = data_f %>%
            filter(country.code == 'be'), 
          aes(color = stationsdist) ,size = 0.5)+
  labs(title="Station Density by Occupancy Level",
       subtitle = 'Distance of each station to the nearest three stations\n',
       caption = 'Figure 3.1.1')+ 
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma",
  name = 'Counts')+
  facet_wrap(~occupancy, ncol = 3) + 
  mapTheme()+
  plotTheme()  +
  theme(legend.position = "bottom")

3.1.2 Distance of Stations to Major Cities

Plots below exhibits a similar patterns that more stations are clustering the central part of the country where there are more bigger cities.

Belgium Big Cities

ggplot() +
  geom_sf(data = rail, color = "grey") +  
  geom_sf(data = data_f %>%
            filter(country.code == 'be'), 
          aes(color = majc_nn5) ,size = 0.5)+
  labs(title="Distance of Stations to Major Cities",
       subtitle = ' ',
       caption = 'Figure 3.1.2')+ 
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma",
  name = 'Counts')+
  facet_wrap(~occupancy, ncol = 3) + 
  mapTheme()+
  plotTheme()  +
  theme(legend.position = "bottom")

3.1.3 High Occupancy Level Proportion of Each Station

The proportion of high occupancy level of each station does not show a specific regularity. It seems that lines to the west are busier compared to lines to the east.

ggplot() +
  geom_sf(data = rail, color = "grey") +  
  geom_sf(data = data_f %>%
            filter(country.code == 'be'), 
          aes(color = F_H_com) ,size = 0.5)+
  labs(title="High Occupancy Level Proportion of Each Station",
       subtitle = ' ',
       caption = 'Figure 3.1.3')+ 
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma",
  name = 'Ratio')+
  mapTheme()+
  plotTheme()  

3.1.4 Station Popularity

As an important feature, station popularity identifies how many lines depart from, arrive at or pass a certain station. Plot below obviously shows that primary station hubs are consistent with the big cities, which sheds a light on the potential high occupancy level.

ggplot() +
  geom_sf(data = rail, color = "grey") +  
  geom_sf(data = data_f %>%
            filter(country.code == 'be'), 
          aes(color = connections) ,size = 0.5)+
  labs(title="Station Popularity Across Nation",
       subtitle = 'Counts of connections of each station\n',
       caption = 'Figure 3.1.4')+ 
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma",
  name = 'Counts')+
  mapTheme()+
  plotTheme() 

3.1.5 Facility Characteristics

Stations show a balance in terms of the facilities. Therefore, facility might not be a vital factor to affect customers’ behaviors.

ggplot() +
  geom_sf(data = rail, color = "grey") +  
  geom_sf(data = data_f %>%
            filter(country.code == 'be'), 
          aes(color = facility_sum) ,size = 0.5)+
  labs(title="Facility Counts Across Stations",
       subtitle = 'Departure Station\n',
       caption = 'Figure 3.1.5')+ 
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma",
  name = 'Counts')+
  mapTheme()+
  plotTheme() 

3.2 Weather Correlation

As for weather features: humidity, wind speed, visibility and temperature, regardless of days and occupancy levels, more train shifts happen when visibility is around 9-10, while temperature is round 10 to 20. Humidity degree is around 60-100. But wind speeds are fluctuating all the time.

grid.arrange(
  ggplot(data_f, aes(interval60,humidity)) + geom_line(color = "#6897BB") +
    labs(title="Weather characteristics: Humidity",
         x="Hour", y="Humidity") + 
    plotTheme(),
  ggplot(data_f, aes(interval60,windspeed)) + geom_line(color = "#6897BB") +
    labs(title="Weather characteristics: Wind Speed", 
         x="Hour", y="Wind Speed") + 
    plotTheme(),
  ggplot(data_f, aes(interval60,temperature)) + geom_line(color = "#6897BB") +
    labs(title="Weather characteristics: Temperature",
         x="Hour", y="temperature",
         caption="Figure 3.2.1") + 
    plotTheme(),
  ggplot(data_f, aes(interval60,visibility)) + geom_line(color = "#6897BB") +
    labs(title="Weather characteristics: Visibility",
         x="Hour", y="Visibility") + 
    plotTheme())

ggplot(data_f %>% 
         mutate(hour = hour(datetime)) %>% 
         filter(occupancy == "high" | occupancy == "medium" | occupancy == "low"), 
       aes(temperature, color = day))+
  geom_freqpoly(binwidth = 1)+
  facet_wrap(~occupancy,ncol=3) +
  scale_colour_manual(values = palette7,
                      name = "Week Day") +
  labs(title="Train Counts Across Temperature by Occupancy Level",
       subtitle = "Belgium, 2016\n",
       caption = "Figure 3.2.2 ",
       x="Temperature", 
       y="Counts") +
  plotTheme()

ggplot(data_f %>% 
         mutate(hour = hour(datetime)) %>% 
         filter(occupancy == "high" | occupancy == "medium" | occupancy == "low"), 
       aes(humidity, color = day))+
  geom_freqpoly(binwidth = 1)+
  facet_wrap(~occupancy,ncol=3) +
  scale_colour_manual(values = palette7,
                      name = "Week Day") +
  labs(title="Train Counts Across Humidity by Occupancy Level",
       subtitle = "Belgium, 2016\n",
       caption = "Figure 3.2.3",
       x="Humidity", 
       y="Counts") +
  plotTheme()

ggplot(data_f %>% 
         mutate(hour = hour(datetime)) %>% 
         filter(occupancy == "high" | occupancy == "medium" | occupancy == "low"), 
       aes(windspeed, color = day))+
  geom_freqpoly(binwidth = 1)+
  facet_wrap(~occupancy,ncol=3) +
  scale_colour_manual(values = palette7,
                      name = "Week Day") +
  labs(title="Train Counts Across Windspeed by Occupancy Level",
       subtitle = "Belgium, 2016\n",
       caption = "Figure 3.2.4",
       x="Windspeed", 
       y="Counts") +
  plotTheme()

ggplot(data_f %>% 
         mutate(hour = hour(datetime)) %>% 
         filter(occupancy == "high" | occupancy == "medium" | occupancy == "low"), 
       aes(visibility, color = day))+
  geom_freqpoly(binwidth = 1)+
  facet_wrap(~occupancy,ncol=3) +
  scale_colour_manual(values = palette7,
                      name = "Week Day") +
  labs(title="Train Counts Across Visibility by Occupancy Level",
       subtitle = "Belgium, 2016\n",
       caption = "Figure 3.2.5",
       x="Visibility", 
       y="Counts") +
  plotTheme()

3.3 Serial Autocorrelation

The station occupancy level exhibits a strong serial pattern. From 6:00 am to 8:00 am, and 16:00 pm to 17:00 pm, more stations are experiencing medium and high occupancy level, and the passengers’ volume sharply decreases in other time periods. For the stations with medium and high occupancy levels, no distinguished difference has been seen between weekdays and weekends. Regardless of weekdays or weekends, trains are generally not busy during 10:00 am - 15:00 pm.

ggplot(data_f %>% 
         mutate(hour = hour(datetime)))+
  geom_freqpoly(aes(hour, color = occupancy), binwidth = 1)+
  scale_colour_manual(values = palette3) +
  labs(title="Train Counts by Occupancy Level (General)",
       subtitle = 'Belgium, 2016\n',
       caption = "Figure 3.3.1",
       
       x="Hour of the Day", 
       y="Counts")+
  plotTheme()

ggplot(data_f %>% mutate(hour = hour(datetime),
                             weekend = ifelse(day %in% c("Sun", "Sat"), "Weekend", "Weekday"))) +
  geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  scale_color_manual(values = palette2,
                     name = 'Period') +
  labs(title="Train Counts by Occupancy Level (Week Periods)",
       subtitle = 'Belgium, 2016\n',
       caption = 'Figure 3.3.2',
       x="Hour of the Day", 
       y="Counts")+
  facet_wrap(~occupancy,ncol=3) +
  plotTheme()

ggplot(data_f %>% mutate(hour = hour(datetime))) +
  geom_freqpoly(aes(hour, color = day), binwidth = 1)+
  scale_color_manual(values = palette7,
                     name = 'Week Day') +
  labs(title="Train Counts by Occupancy Level (Week Days)",
       subtitle = 'Belgium, 2016\n',
       caption = 'Figure 3.3.3',
       x="Hour of the Day", 
       y="Counts")+
  facet_wrap(~occupancy,ncol=3) +
  plotTheme()

3.4 Spatial Autocorrelation

Station workloads are unbalanced across the nation. Occupancy level varies a lot across stations nationwide. Some stations are connecting more routes, while some stations are less popular. The animation at the right side indicates the top 20 busiest lines across the nation, and we can easily find that Brussel is an undoubtful transportation hub from which a least 150 different lines depart.

test04 <- data %>%
  group_by(vehicle) %>%
  summarise(n = n()) %>%
  filter(n >= 12)
freq_v = c("IC1518","IC429","IC1515","IC407","P7305","IC1807","IC3631","1828","8015","IC1831","IC539","P7444","S83978", "IC1507","IC716","L557","S23665","IC3432","IC4317","S53586")

p1 <- ggplot() +
  geom_sf(data = rail, aes(color = "grey")) + 
  geom_sf(data=subset(data_f, data_f$vehicle %in% freq_v),aes(colour = vehicle),size=2,show.legend = "point")+ 
  labs(title="Top 20 Busiest Lines",
       subtitle = 'Departure Stations\n',
       caption = "Figure 3.4")+
  mapTheme()+
  plotTheme() +
  theme(legend.position = "bottom") + 
  transition_manual(factor(vehicle, levels = freq_v), cumulative = TRUE) 
 


## occupancy by lines: destination
p2 <- ggplot()+
  geom_sf(data=rail,aes(color = "grey"))+
  geom_sf(data=subset(data_t, data_t$vehicle %in% freq_v),aes(colour = vehicle),size=2,show.legend = "point")+ 
  labs(title="Top 20 Frequent Lines Plotted to Destination\n",
       caption = 'Figure')+
  mapTheme()+
  plotTheme() +
  theme(legend.position = "bottom") +  
  transition_manual(factor(vehicle, levels = freq_v), cumulative = TRUE)

gganimate::animate(p1,  duration=10,renderer = gifski_renderer())

3.5 Space/Time Correlation

Occupancy levels are influenced by both temporal and spatial factors. More trains depart from Tuesdays to Saturdays compared to other days during the week. On Mondays and Tuesdays, stations with medium and high occupancy levels spread from east to west, while on Tuesdays, Wednesdays, Thursdays, as well as Saturdays, stations in the south part of the Nation need to take more passengers.

ggplot() +
  geom_sf(data = rail, color = "grey") +  
  geom_sf(data = data_f %>%
            filter(country.code == 'be'), 
          aes(color = occupancy) ,size = 0.5)+
  labs(title="Occupancy Level Across Departure Stations by Week Days\n",
       caption = 'Figure 3.5.1')+ 
  facet_wrap(~day, nrow =2) +
  scale_colour_manual(values = palette3,
                      name = 'Occupancy') +
  mapTheme()+
  plotTheme() 

ggplot() +
  geom_sf(data = rail, color = "grey") +  
  geom_sf(data = data_t %>%
            filter(country.code == 'be'), 
          aes(color = occupancy) ,size = 0.5)+
  labs(title="Occupancy Level Across Destination Stations by Week Days\n",
       caption = 'Figure 3.5.2')+ 
  facet_wrap(~day, nrow =2) +
  scale_colour_manual(values = palette3,
                      name = 'Occupancy') +
  mapTheme()+
  plotTheme() 

IV. Model Building and Prediction

4.1 Model Building

Since we are more concerned about whether the occupancy level is high or not, starting from the model building, this project is planning to transform the independent variable to high_occ which is a binomial variable, consisting of two levels: Yes and No to identify the occupancy level is high or not. Therefore, we will build a Logistic Regression model. After screening the features we built above, this project builds two models: one does not include time-lag features while the other does. The statistics summary of the two models are also shown below.

# Create regression dataset
data_f <- data_f %>% 
  rowid_to_column(var = "rowIndex") 
  
data_f_reg <- data_f %>% 
  mutate(hour = hour(interval60),
         min = hour(interval15)) %>%
  dplyr::select(-name,-date,-time,-connection,-from.X,-from.Y,-to.X,-to.Y,-datetime,-X,-lat,-lng,-date_time,-interval60,-interval15) %>% 
  filter(country.code == 'be') %>% 
  st_drop_geometry() %>%
  na.omit()

data_f_reg$high_occ <- ifelse(data_f_reg$occupancy == "high", "Yes","No")
data_f_reg$high_occ_num <- ifelse(data_f_reg$occupancy == "high", 1,0)
# table(data_f_reg$high_occ)

set.seed(3456)
trainIndex <- createDataPartition(data_f_reg$occupancy, p = .65,
                                  list = FALSE,
                                  times = 1)
data_f_regTrain <- data_f_reg[ trainIndex,]
data_f_regTest  <- data_f_reg[-trainIndex,]


# m0 <- multinom(occupancy ~ from + to + avg_stop_times + facility_sum + temperature + humidity + windspeed + visibility + F_H_com + F_M_com + F_L_com + T_H_com + T_M_com + T_L_com + connections + week + day + hour + min + time_of_day + vehicle_type , data = na.omit(data_f_reg),MaxNWts =10000000)

# m1 <- multinom(occupancy ~ from + to + avg_stop_times + facility_sum + temperature + humidity + windspeed + visibility + F_H_com + F_M_com + F_L_com + T_H_com + T_M_com + T_L_com + connections + week + day + hour + min + vehicle_type + lagHour + lag2Hours + lag3Hours + lag4Hours + lag12Hours + time_of_day,  data = na.omit(data_f_reg),MaxNWts =10000000)

m0 <- glm(high_occ_num ~ from + to + avg_stop_times + facility_sum + temperature + humidity + windspeed + visibility + F_H_com + F_M_com + F_L_com + T_H_com + T_M_com + T_L_com + connections + week + day + hour + min + vehicle_type + time_of_day + stationsdist + distance,
          data = data_f_reg,
          family="binomial" (link="logit"))

m1 <- glm(high_occ_num ~ from + to + avg_stop_times + facility_sum + temperature + humidity + windspeed + visibility + F_H_com + F_M_com + F_L_com + T_H_com + T_M_com + T_L_com + connections + week + day + hour + min + vehicle_type + lagHour + lag2Hours + lag3Hours + lag4Hours + lag12Hours + time_of_day + stationsdist + distance,
          data = data_f_reg,
          family="binomial" (link="logit"))



stargazer(m0, m1, type = "html", 
          title = "Table 1. Summary Statistics of Model 1 and Model 2",
          header = FALSE,
          single.row = TRUE,
          column.labels=c("Model 1","Model 2"))
Table 1. Summary Statistics of Model 1 and Model 2
Dependent variable:
high_occ_num
Model 1 Model 2
(1) (2)
from008811304 0.277 (0.986) 0.221 (0.989)
from008811411 -0.525 (1.073) -0.468 (1.076)
from008811510 0.035 (1.060) 0.074 (1.064)
from008811536 17.748 (3,662.280) 17.713 (3,569.628)
from008811601 -0.790 (0.808) -0.764 (0.812)
from008811916 -1.911 (1.289) -1.928 (1.295)
from008812005 -0.122 (0.513) -0.105 (0.516)
from008813003 -0.281 (0.522) -0.239 (0.524)
from008814001 -0.073 (0.540) -0.052 (0.543)
from008814308 0.033 (0.738) 0.101 (0.744)
from008819406 0.030 (0.636) 0.068 (0.637)
from008821006 -0.529 (0.571) -0.434 (0.574)
from008821121 0.257 (0.568) 0.281 (0.572)
from008821196 -1.222 (1.238) -1.098 (1.234)
from008821600 -0.195 (0.789) -0.190 (0.794)
from008821717 0.838 (0.967) 0.928 (0.985)
from008822004 -0.407 (0.583) -0.362 (0.585)
from008831005 -0.930 (0.680) -0.804 (0.682)
from008832458 -18.415 (3,738.609) -18.603 (3,733.916)
from008833001 0.858 (0.529) 0.863 (0.534)
from008833209 0.570 (0.792) 0.651 (0.801)
from008833605 -0.350 (1.274) -0.215 (1.295)
from008841004 -0.813 (0.781) -0.756 (0.780)
from008844008 1.592 (1.437) 1.319 (1.445)
from008863008 0.191 (0.885) 0.181 (0.887)
from008872009 -0.953 (1.058) -0.916 (1.059)
from008881000 0.984 (1.192) 0.973 (1.209)
from008881166 -0.054 (1.120) 0.005 (1.130)
from008883006 1.274 (0.976) 1.313 (0.981)
from008883311 -0.422 (0.782) -0.369 (0.789)
from008885001 1.024 (1.458) 1.087 (1.451)
from008886348 17.525 (3,426.766) 17.374 (3,484.638)
from008891009 -0.155 (0.615) -0.088 (0.618)
from008891140 -1.499 (1.169) -1.451 (1.171)
from008891702 -0.069 (0.995) -0.047 (0.996)
from008892007 0.046 (0.522) 0.089 (0.524)
from008892080 1.568 (1.178) 1.610 (1.201)
from008892106 0.451 (0.906) 0.430 (0.912)
from008892205 -0.276 (1.545) -0.295 (1.530)
from008893013 0.442 (0.968) 0.701 (0.978)
from008893120 0.574 (0.693) 0.566 (0.696)
from008893401 -16.274 (1,794.025) -16.190 (1,811.618)
from008893559 1.428 (1.274) 1.447 (1.245)
from008894201 -1.043 (1.036) -0.945 (1.035)
from008894508 1.397* (0.835) 1.412* (0.842)
from008894748 -0.513 (1.020) -0.337 (1.030)
from008895802 0.816 (0.836) 0.846 (0.843)
from008895836 -1.367 (1.532) -1.393 (1.508)
from008896008 0.843 (0.740) 0.867 (0.747)
from008896149 -0.427 (0.751) -0.365 (0.753)
to008200100 18.198 (6,522.640) 18.192 (6,522.638)
to008400058 0.596 (7,096.338) 0.588 (7,099.244)
to008400424 0.391 (9,224.405) 0.608 (9,224.403)
to008727100 19.214 (6,522.640) 19.092 (6,522.638)
to008731901 3.970 (7,988.570) 4.645 (7,977.132)
to008811007 18.316 (6,522.640) 18.181 (6,522.638)
to008811148 0.829 (7,281.301) 0.970 (7,277.937)
to008811163 39.063 (7,529.286) 38.848 (7,529.379)
to008811171 18.075 (6,522.640) 18.000 (6,522.638)
to008811189 19.251 (6,522.640) 19.138 (6,522.638)
to008811213 18.096 (6,522.640) 17.923 (6,522.638)
to008811221 19.707 (6,522.641) 19.949 (6,522.638)
to008811254 0.164 (9,224.405) 0.507 (9,224.403)
to008811288 -0.100 (9,224.405) -0.019 (9,224.403)
to008811304 18.793 (6,522.640) 18.842 (6,522.638)
to008811403 0.611 (9,224.405) 0.764 (9,224.403)
to008811411 18.681 (6,522.640) 18.655 (6,522.638)
to008811437 19.317 (6,522.640) 19.114 (6,522.638)
to008811445 0.111 (9,224.405) -0.102 (9,224.403)
to008811510 1.571 (7,480.450) 1.404 (7,435.526)
to008811528 18.910 (6,522.640) 18.650 (6,522.638)
to008811536 19.814 (6,522.640) 19.556 (6,522.638)
to008811544 0.806 (7,480.450) 0.790 (7,435.526)
to008811601 19.215 (6,522.640) 19.161 (6,522.638)
to008811676 18.846 (6,522.640) 18.790 (6,522.638)
to008811916 18.912 (6,522.640) 18.759 (6,522.638)
to008812005 18.678 (6,522.640) 18.563 (6,522.638)
to008812021 18.725 (6,522.640) 18.569 (6,522.638)
to008812047 0.116 (7,427.531) -0.051 (7,447.005)
to008812120 37.339 (9,224.405) 36.958 (9,224.403)
to008812211 1.007 (9,224.405) 0.759 (9,224.403)
to008813003 18.316 (6,522.640) 18.181 (6,522.638)
to008813045 -0.051 (9,224.405) -0.315 (9,224.403)
to008814001 18.420 (6,522.640) 18.289 (6,522.638)
to008814118 37.215 (7,985.843) 37.043 (7,954.779)
to008814167 19.974 (6,522.640) 19.956 (6,522.638)
to008814209 0.727 (9,224.405) 0.616 (9,224.403)
to008814241 38.467 (9,224.405) 38.617 (9,224.403)
to008814258 19.659 (6,522.640) 19.415 (6,522.638)
to008814266 0.266 (9,224.405) -0.045 (9,224.403)
to008814308 18.962 (6,522.640) 18.972 (6,522.638)
to008814340 18.368 (6,522.640) 18.179 (6,522.638)
to008814373 1.593 (9,224.405) 1.464 (9,224.403)
to008814456 1.583 (7,888.768) 1.480 (7,890.726)
to008815016 0.056 (9,224.405) -0.180 (9,224.403)
to008815040 37.141 (9,224.405) 36.683 (9,224.403)
to008819406 18.545 (6,522.640) 18.471 (6,522.638)
to008821006 17.977 (6,522.640) 17.888 (6,522.638)
to008821071 0.989 (9,224.405) 0.994 (9,224.403)
to008821105 1.474 (7,292.504) 1.309 (7,283.655)
to008821121 17.873 (6,522.640) 17.780 (6,522.638)
to008821147 0.237 (9,224.405) 0.085 (9,224.403)
to008821196 17.779 (6,522.640) 17.626 (6,522.638)
to008821238 -0.320 (9,224.405) -0.196 (9,224.403)
to008821311 0.670 (7,508.184) 0.728 (7,489.908)
to008821535 0.069 (7,986.592) 0.010 (7,976.003)
to008821600 18.350 (6,522.640) 18.149 (6,522.638)
to008821659 1.595 (9,224.405) 1.336 (9,224.403)
to008821717 18.487 (6,522.640) 18.426 (6,522.638)
to008821816 37.660 (9,224.405) 37.913 (9,224.403)
to008821832 -0.401 (7,906.148) -0.541 (7,872.087)
to008821865 0.353 (9,224.405) -0.165 (9,224.403)
to008821907 37.614 (9,224.405) 37.563 (9,224.403)
to008821964 -0.187 (6,996.782) -0.249 (7,003.054)
to008822004 18.417 (6,522.640) 18.320 (6,522.638)
to008822053 1.139 (9,224.405) 0.694 (9,224.403)
to008822111 19.123 (6,522.640) 18.883 (6,522.638)
to008822145 1.009 (9,224.405) 0.879 (9,224.403)
to008822160 36.457 (6,764.862) 36.386 (6,769.547)
to008822251 0.871 (7,988.570) 1.020 (7,988.559)
to008822343 1.148 (7,257.008) 1.199 (7,274.557)
to008831005 19.010 (6,522.640) 18.871 (6,522.638)
to008831039 18.635 (6,522.640) 18.519 (6,522.638)
to008831088 1.057 (9,224.405) 0.465 (9,224.403)
to008831112 38.371 (9,224.405) 38.431 (9,224.403)
to008831138 0.004 (9,224.405) -0.322 (9,224.403)
to008831310 36.993 (9,224.405) 36.759 (9,224.403)
to008831401 17.962 (6,522.640) 17.936 (6,522.638)
to008831765 1.064 (6,692.059) 1.041 (6,687.666)
to008831781 0.046 (9,224.405) 0.183 (9,224.403)
to008831807 0.023 (7,872.116) -0.052 (7,893.733)
to008832250 2.172 (7,468.877) 2.214 (7,471.611)
to008832375 1.032 (9,224.405) 0.579 (9,224.403)
to008832409 -0.073 (7,849.204) 0.036 (7,880.343)
to008832433 38.148 (7,518.114) 38.238 (7,515.779)
to008832458 18.265 (6,522.640) 17.935 (6,522.638)
to008833001 18.631 (6,522.640) 18.526 (6,522.638)
to008833126 1.804 (7,988.570) 1.764 (7,986.729)
to008833134 36.998 (9,224.405) 36.587 (9,224.403)
to008833175 1.816 (9,224.405) 1.757 (9,224.403)
to008833209 19.262 (6,522.640) 19.188 (6,522.638)
to008833233 17.899 (6,522.640) 17.737 (6,522.638)
to008833266 0.923 (9,224.405) 0.760 (9,224.403)
to008833274 17.964 (6,522.640) 17.992 (6,522.638)
to008833308 15.863 (6,522.640) 15.695 (6,522.638)
to008833605 19.246 (6,522.640) 19.099 (6,522.638)
to008841004 18.378 (6,522.640) 18.345 (6,522.638)
to008841202 37.466 (6,937.949) 37.391 (6,931.687)
to008841400 17.350 (6,522.640) 17.383 (6,522.638)
to008841525 19.698 (6,522.640) 19.737 (6,522.638)
to008841558 18.932 (6,522.640) 18.857 (6,522.638)
to008843208 18.510 (6,522.640) 18.219 (6,522.638)
to008843307 0.703 (9,224.405) 0.454 (9,224.403)
to008844008 1.390 (9,224.405) 1.540 (9,224.403)
to008844206 -1.489 (7,417.830) -1.214 (7,448.228)
to008861200 18.830 (6,522.640) 18.716 (6,522.638)
to008861523 1.617 (9,224.405) 1.751 (9,224.403)
to008863008 18.978 (6,522.640) 18.908 (6,522.638)
to008863404 0.547 (9,224.405) 0.474 (9,224.403)
to008864345 0.341 (9,224.405) 0.057 (9,224.403)
to008864451 1.916 (9,224.405) 1.949 (9,224.403)
to008864469 39.265 (9,224.405) 39.261 (9,224.403)
to008864501 19.648 (6,522.640) 19.501 (6,522.638)
to008866001 0.718 (9,224.405) 0.812 (9,224.403)
to008871308 1.243 (9,224.405) 1.404 (9,224.403)
to008872009 17.446 (6,522.640) 17.401 (6,522.638)
to008873122 37.657 (9,224.405) 37.780 (9,224.403)
to008873320 0.188 (9,224.405) 0.394 (9,224.403)
to008874005 -0.346 (7,904.006) -0.434 (7,871.563)
to008874609 1.025 (9,224.405) 1.028 (9,224.403)
to008874724 -0.105 (9,224.405) -0.539 (9,224.403)
to008881000 19.244 (6,522.640) 19.096 (6,522.638)
to008881166 18.144 (6,522.640) 18.033 (6,522.638)
to008882107 0.683 (9,224.405) 0.706 (9,224.403)
to008882206 1.376 (7,983.676) 1.426 (7,986.056)
to008883006 19.125 (6,522.640) 19.001 (6,522.638)
to008883113 -1.243 (9,224.405) -1.440 (9,224.403)
to008883212 19.256 (6,522.640) 18.900 (6,522.638)
to008883311 18.945 (6,522.640) 18.766 (6,522.638)
to008883436 20.023 (6,522.640) 19.780 (6,522.638)
to008883808 -0.532 (7,003.532) -0.664 (7,023.284)
to008884004 37.223 (9,224.405) 37.322 (9,224.404)
to008884541 36.467 (9,224.405) 36.421 (9,224.403)
to008884566 -0.547 (9,224.405) -1.012 (9,224.404)
to008884715 0.00000 (9,224.405) -0.194 (9,224.404)
to008884855 0.344 (9,224.405) -0.067 (9,224.404)
to008885001 -17.016 (8,134.162) -16.801 (8,177.016)
to008885704 -0.160 (7,515.026) -0.333 (7,507.061)
to008885753 -0.903 (9,224.405) -1.168 (9,224.404)
to008886009 1.736 (7,368.009) 1.730 (7,395.100)
to008886348 19.377 (6,522.641) 18.970 (6,522.638)
to008891009 17.410 (6,522.640) 17.322 (6,522.638)
to008891033 2.778 (9,224.405) 2.320 (9,224.403)
to008891116 1.126 (9,224.405) 0.987 (9,224.403)
to008891132 3.253 (7,945.355) 3.218 (7,949.042)
to008891140 17.186 (6,522.640) 16.975 (6,522.638)
to008891314 0.424 (9,224.405) 0.606 (9,224.403)
to008891405 37.443 (9,224.405) 37.135 (9,224.403)
to008891660 1.024 (7,984.392) 0.820 (7,986.949)
to008891702 17.957 (6,522.640) 17.887 (6,522.638)
to008892007 17.987 (6,522.640) 17.872 (6,522.638)
to008892080 19.499 (6,522.640) 19.394 (6,522.638)
to008892106 16.268 (6,522.640) 16.145 (6,522.638)
to008892205 0.528 (7,336.901) 0.359 (7,315.303)
to008892254 18.152 (6,522.640) 18.166 (6,522.638)
to008892403 18.133 (6,522.641) 18.213 (6,522.638)
to008892452 -0.444 (9,224.405) -0.456 (9,224.403)
to008892601 37.023 (7,433.402) 36.970 (7,403.665)
to008892635 38.295 (9,224.405) 37.759 (9,224.403)
to008892650 19.199 (6,522.640) 19.086 (6,522.638)
to008893013 17.328 (6,522.640) 17.331 (6,522.638)
to008893039 -1.258 (9,224.405) -1.075 (9,224.403)
to008893120 18.238 (6,522.640) 18.054 (6,522.638)
to008893401 16.263 (6,522.640) 16.006 (6,522.638)
to008893526 17.651 (6,522.641) 17.276 (6,522.638)
to008893534 1.048 (7,988.570) 0.990 (7,985.799)
to008893542 0.587 (9,224.405) 0.461 (9,224.403)
to008893559 18.699 (6,522.640) 18.761 (6,522.638)
to008893567 18.851 (6,522.640) 18.687 (6,522.638)
to008893708 1.331 (7,768.432) 1.379 (7,883.269)
to008894201 16.768 (6,522.640) 16.681 (6,522.638)
to008894425 1.659 (7,273.029) 1.790 (7,275.658)
to008894508 18.929 (6,522.640) 18.826 (6,522.638)
to008894748 18.425 (6,522.640) 18.267 (6,522.638)
to008895000 17.240 (6,522.640) 17.209 (6,522.638)
to008895091 17.988 (6,522.640) 18.037 (6,522.638)
to008895208 17.534 (6,522.640) 17.419 (6,522.638)
to008895646 17.602 (6,522.640) 17.642 (6,522.638)
to008895760 18.011 (6,522.640) 17.991 (6,522.638)
to008895802 17.961 (6,522.640) 17.899 (6,522.638)
to008895836 17.198 (6,522.640) 17.126 (6,522.638)
to008896008 19.027 (6,522.640) 18.890 (6,522.638)
to008896115 -0.318 (7,291.233) -0.184 (7,288.598)
to008896149 17.560 (6,522.640) 17.420 (6,522.638)
to008896370 0.866 (9,224.405) 1.060 (9,224.403)
to008896388 36.686 (9,224.405) 36.469 (9,224.403)
to008896412 37.178 (9,224.405) 37.238 (9,224.403)
to008896735 18.703 (6,522.640) 18.475 (6,522.638)
to008896925 19.002 (6,522.640) 18.755 (6,522.638)
avg_stop_times
facility_sum
temperature 0.085*** (0.023) 0.080*** (0.023)
humidity -0.012 (0.008) -0.012 (0.008)
windspeed -0.008 (0.012) -0.009 (0.012)
visibility -0.102** (0.041) -0.103** (0.041)
F_H_com
F_M_com
F_L_com
T_H_com
T_M_com
T_L_com
connections
week 0.152*** (0.033) 0.143*** (0.033)
dayMon 0.195 (0.269) 0.225 (0.271)
daySat 0.172 (0.215) 0.162 (0.217)
daySun -0.190 (0.309) -0.118 (0.310)
dayThu -0.528** (0.244) -0.502** (0.245)
dayTue -0.162 (0.235) -0.151 (0.237)
dayWed -0.278 (0.228) -0.301 (0.230)
hour -0.018 (0.021) -0.016 (0.021)
min
vehicle_typeICE -17.923 (4,583.271) -17.952 (4,548.114)
vehicle_typeICT 1.113 (1.458) 1.163 (1.414)
vehicle_typeL -0.992** (0.432) -0.989** (0.434)
vehicle_typeP -0.129 (0.251) -0.142 (0.252)
vehicle_typeS -0.115 (0.363) -0.165 (0.366)
vehicle_typeTGV
vehicle_typeTHA 0.046 (1.249) 0.153 (1.231)
vehicle_typeTRN 37.258 (9,224.404) 37.102 (9,224.404)
vehicle_typeundefined -0.678 (0.458) -0.685 (0.460)
lagHour 0.200*** (0.077)
lag2Hours 0.075 (0.076)
lag3Hours -0.070 (0.078)
lag4Hours -0.023 (0.077)
lag12Hours -0.037 (0.077)
time_of_dayMid-Day -0.797*** (0.253) -0.752*** (0.254)
time_of_dayOvernight 0.407 (0.278) 0.374 (0.278)
time_of_dayPM Rush -0.008 (0.266) 0.012 (0.266)
stationsdist
distance -0.00000 (0.00000) -0.00000 (0.00000)
Constant -23.758 (6,522.641) -23.578 (6,522.638)
Observations 1,722 1,722
Log Likelihood -836.410 -831.933
Akaike Inf. Crit. 2,200.820 2,201.865
Note: p<0.1; p<0.05; p<0.01

4.2 Occupancy Prediction

By leveraging the two models, we predict the occupancy level and generate two plots to show the distribution of predicted outcomes between the two models. Without distinguished differences, we cannot directly conclude which one is better so far. But since we are more concerned about the time-lag features in our model, we will stick on the Model 2 from now on.

testProbs0 <- data.frame(Outcome0 = as.factor(data_f_reg$high_occ_num),
                        Probs0 = predict(m0, newdata = data_f_reg, na.rm = TRUE, type= "response" ))


testProbs1 <- data.frame(Outcome1 = as.factor(data_f_reg$high_occ_num),
                        Probs1 = predict(m1, newdata = data_f_reg, na.rm = TRUE, type= "response" ))

# View(testProbs1)

grid.arrange(
  ggplot(testProbs0, aes(x = Probs0, fill = as.factor(Outcome0))) + 
  geom_density() +
  facet_grid(Outcome0 ~ .) +
  scale_fill_manual(values = palette2,
                          name = "Predicated High Occupancy",
                    labels = c("Low/Medium", "High")) +
  labs(x = "Occupancy", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome - Model 1\n") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "bottom") +
  plotTheme(),

ggplot(testProbs1, aes(x = Probs1, fill = as.factor(Outcome1))) + 
  geom_density() +
  facet_grid(Outcome1 ~ .) +
  scale_fill_manual(values = palette2,
                          name = "Predicated High Occupancy",
                    labels = c("Low/Medium", "High")) +
  labs(x = "Occupancy", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome - Model 2\n",
       caption = 'Figure 4') +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "bottom") +
  plotTheme())

V. Model Evaluation and Validation

5.1 Good of Fit: Confusion Matrix

A “confusion matrix” for the threshold of 50% shows us the rate at which we got True Positives (aka Sensitivity), False Positives, True Negatives (aka Specificity) and False Negatives for that threshold.

In this case, * True Positives: Predicted correctly the occupancy of train is high. * False Positives: Predicted incorrectly the occupancy of train is high. * True Negatives: Predicted correctly the occupancy of train is not high. * False Negatives: Predicted incorrectly the occupancy of train is not high.

This analysis more cares about sensitivity since the goal is to find out the stations with high occupancy level. So far, the sensitivity of the model is 0.47.

testProbs1 <- 
  testProbs1 %>%
  mutate(predOutcome1  = as.factor(ifelse(testProbs1$Probs1 > 0.5 , 1, 0)))


caret::confusionMatrix(testProbs1$predOutcome1, testProbs1$Outcome1, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1041  296
##          1  127  258
##                                           
##                Accuracy : 0.7544          
##                  95% CI : (0.7333, 0.7745)
##     No Information Rate : 0.6783          
##     P-Value [Acc > NIR] : 2.744e-12       
##                                           
##                   Kappa : 0.3881          
##                                           
##  Mcnemar's Test P-Value : 3.124e-16       
##                                           
##             Sensitivity : 0.4657          
##             Specificity : 0.8913          
##          Pos Pred Value : 0.6701          
##          Neg Pred Value : 0.7786          
##              Prevalence : 0.3217          
##          Detection Rate : 0.1498          
##    Detection Prevalence : 0.2236          
##       Balanced Accuracy : 0.6785          
##                                           
##        'Positive' Class : 1               
## 
conf.matrix.table1 <-
   testProbs1 %>%
      count(predOutcome1, Outcome1) %>%
      summarize(True_Negative = sum(n[predOutcome1==0 & Outcome1==0]),
                True_Positive = sum(n[predOutcome1==1 & Outcome1==1]),
                False_Negative = sum(n[predOutcome1==0 & Outcome1==1]),
                False_Positive = sum(n[predOutcome1==1 & Outcome1==0])) %>%
       gather(Variable, Count) %>%
    bind_cols(data.frame(Description = c(
              "Predicted correctly the occupancy of train is not high",
              "Predicted correctly the occupancy of train is high",
              "Predicted incorrectly the occupancy of train is not high",
              "Predicted incorrectly the occupancy of train is high")))


kable(conf.matrix.table1) %>% 
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general = "Table 2. Confusion Matrix Table\n",
           general_title= '\n')
Variable Count Description
True_Negative 1041 Predicted correctly the occupancy of train is not high
True_Positive 258 Predicted correctly the occupancy of train is high
False_Negative 296 Predicted incorrectly the occupancy of train is not high
False_Positive 127 Predicted incorrectly the occupancy of train is high

Table 2. Confusion Matrix Table

5.2 Cross Validation on 100 k-folds

This section conducted cross-validation on both models. The outputs include the ROC, sensitivity and specificity across the 100 k-folds. The model generalizes well with regard to sensitivity. Since we are more concerned about the sensitivity as mentioned above, such an outcome is satisfied.

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction = twoClassSummary)


cvFit.reg1 <- train(high_occ ~ from + to + avg_stop_times + facility_sum + temperature + humidity + windspeed + visibility + F_H_com + F_M_com + F_L_com + T_H_com + T_M_com + T_L_com + connections + week + day + hour + min + vehicle_type + lagHour + lag2Hours + lag3Hours + lag4Hours + lag12Hours + time_of_day + distance + stationsdist
                      , data = data_f_reg %>%
                  dplyr::mutate(y = ifelse(high_occ =="no","c1.yes","c1.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)
dplyr::select(cvFit.reg1$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit.reg1$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#3e7ec2") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", 
         title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines\n",
         caption = "Figure 5.2") +
  plotTheme()

kable(cvFit.reg1$resample,
          caption = 'Table 3. RESULT. Cross-validation Test: Summary of ROC, Sensitivity, and Speciality') %>%
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = "100%", height = "200px")
Table 3. RESULT. Cross-validation Test: Summary of ROC, Sensitivity, and Speciality
ROC Sens Spec Resample
0.5272727 0.8181818 0.2000000 Fold001
0.8030303 0.9090909 0.1666667 Fold002
0.6666667 0.5833333 0.6666667 Fold003
0.6500000 1.0000000 0.2000000 Fold004
0.7000000 0.8333333 0.4000000 Fold005
0.6944444 0.9166667 0.5000000 Fold006
0.5555556 0.8333333 0.1666667 Fold007
0.5333333 0.6666667 0.4000000 Fold008
0.4722222 0.9166667 0.3333333 Fold009
0.4500000 0.9166667 0.2000000 Fold010
0.5000000 0.6363636 0.3333333 Fold011
0.6527778 0.5833333 0.3333333 Fold012
0.5000000 0.9166667 0.1666667 Fold013
0.4666667 0.6666667 0.0000000 Fold014
0.5833333 1.0000000 0.4000000 Fold015
0.6363636 0.6363636 0.4000000 Fold016
0.3666667 0.9166667 0.2000000 Fold017
0.6909091 0.9090909 0.4000000 Fold018
0.7500000 0.7500000 0.6000000 Fold019
0.7500000 0.8333333 0.2000000 Fold020
0.7222222 0.7500000 0.5000000 Fold021
0.7638889 0.9166667 0.3333333 Fold022
0.6181818 0.7272727 0.6000000 Fold023
0.6545455 0.9090909 0.4000000 Fold024
0.6666667 0.9090909 0.3333333 Fold025
0.4583333 0.9166667 0.1666667 Fold026
0.4848485 0.8181818 0.0000000 Fold027
0.5833333 0.7500000 0.4000000 Fold028
0.6000000 0.8333333 0.4000000 Fold029
0.5833333 0.7500000 0.1666667 Fold030
0.3636364 0.5454545 0.2000000 Fold031
0.7666667 0.8333333 0.4000000 Fold032
1.0000000 0.8181818 1.0000000 Fold033
0.7638889 0.8333333 0.5000000 Fold034
0.6333333 0.8333333 0.6000000 Fold035
0.5909091 0.8181818 0.1666667 Fold036
0.6583333 1.0000000 0.4000000 Fold037
0.5303030 0.6363636 0.3333333 Fold038
0.5757576 0.9090909 0.1666667 Fold039
0.5555556 0.7500000 0.1666667 Fold040
0.6111111 0.7500000 0.5000000 Fold041
0.6363636 1.0000000 0.0000000 Fold042
0.6000000 0.7500000 0.4000000 Fold043
0.6666667 0.8333333 0.3333333 Fold044
0.7500000 0.9166667 0.2000000 Fold045
0.7916667 0.9166667 0.5000000 Fold046
0.3833333 0.8333333 0.2000000 Fold047
0.6666667 1.0000000 0.3333333 Fold048
0.5277778 0.6666667 0.1666667 Fold049
0.6250000 0.8333333 0.3333333 Fold050
0.8472222 0.9166667 0.5000000 Fold051
0.6212121 0.8181818 0.3333333 Fold052
0.7666667 0.8333333 0.4000000 Fold053
0.8472222 0.9166667 0.3333333 Fold054
0.3636364 0.9090909 0.1666667 Fold055
0.6727273 0.8181818 0.4000000 Fold056
0.5833333 0.9166667 0.4000000 Fold057
0.6969697 0.8181818 0.3333333 Fold058
0.8333333 0.9166667 0.6000000 Fold059
0.6515152 1.0000000 0.3333333 Fold060
0.8000000 1.0000000 0.6000000 Fold061
0.6805556 0.6666667 0.5000000 Fold062
0.4500000 0.8333333 0.2000000 Fold063
0.5272727 1.0000000 0.0000000 Fold064
0.7500000 1.0000000 0.1666667 Fold065
0.5277778 0.8333333 0.5000000 Fold066
0.5909091 0.9090909 0.3333333 Fold067
0.7916667 0.8333333 0.5000000 Fold068
0.7500000 1.0000000 0.1666667 Fold069
0.4000000 0.8333333 0.4000000 Fold070
0.5000000 0.7500000 0.1666667 Fold071
0.6166667 0.8333333 0.4000000 Fold072
0.7424242 0.9090909 0.3333333 Fold073
0.6666667 0.7272727 0.5000000 Fold074
0.6111111 0.7500000 0.0000000 Fold075
0.2500000 0.6666667 0.0000000 Fold076
0.6500000 0.7500000 0.2000000 Fold077
0.3500000 0.5833333 0.2000000 Fold078
0.7818182 0.9090909 0.4000000 Fold079
0.4833333 0.9166667 0.2000000 Fold080
0.7666667 0.8333333 0.4000000 Fold081
0.9833333 1.0000000 0.4000000 Fold082
0.3888889 0.5833333 0.1666667 Fold083
0.5454545 0.8181818 0.1666667 Fold084
0.4909091 0.8181818 0.0000000 Fold085
0.6166667 0.9166667 0.2000000 Fold086
0.6666667 0.8333333 0.3333333 Fold087
0.8472222 1.0000000 0.6666667 Fold088
0.4722222 0.8333333 0.1666667 Fold089
0.5972222 0.8333333 0.1666667 Fold090
0.6666667 0.9090909 0.3333333 Fold091
0.6944444 0.9166667 0.3333333 Fold092
0.5303030 0.8181818 0.0000000 Fold093
0.5833333 0.8333333 0.4000000 Fold094
0.7454545 0.7272727 0.6000000 Fold095
0.6527778 0.8333333 0.3333333 Fold096
0.6000000 0.8333333 0.2000000 Fold097
0.6111111 0.8333333 0.1666667 Fold098
0.8194444 0.8333333 0.5000000 Fold099
0.5757576 0.7272727 0.1666667 Fold100

5.3 Model Generalization Across Space

This section creates two sets of plots to show how model prediction is generalized across stations. Basically, the model is well generalized across the stations. More wrong predictions occurred around Gent-Sint-Pieters, Brussel-Noord/Bruxelles-Nord, Brussel-Centraal/Bruxelles-Central, Brussel-Zuid, Leuven, Antwerpen-Centraal. Those are comparatively big stations around big cities, thus more factors migh affect the accuracy for prediction.

AA <- data_f_reg %>%
   mutate(Probs = predict(m1, newdata = data_f_reg, na.rm = TRUE, type= "response" ),
          Pred  = as.factor(ifelse(Probs > 0.5 , 1, 0))) %>%
  dplyr::select(rowIndex, high_occ,high_occ_num, Pred)


new_data <- merge(x = AA, y = data_f, by.x = "rowIndex") %>%
  mutate(CM = case_when(
    Pred == 0 & high_occ_num == 0 ~ 'TN',
    Pred == 0 & high_occ_num == 1 ~ 'FN',
    Pred == 1 & high_occ_num == 0 ~ 'FP',
    Pred == 1 & high_occ_num == 1 ~ 'TP'),
    Error = ifelse(Pred == high_occ_num, "Error", "No Error"))

# table(new_data$CM)
# View(new_data)

conf.matrix.station <- new_data %>%
  group_by(name, CM) %>%
  summarize(count = n())

conf.matrix.station.sf <- merge(x = conf.matrix.station, y = stations, by = "name" , all.x = TRUE)  %>%
  st_sf()

error.station <- new_data %>%
  group_by(name,Error) %>% 
  summarise(count = n())
  
# View(error.station)

error.station.sf <- merge(x = error.station, y = stations, by = "name" , all.x = TRUE)  %>%
  st_sf()
ggplot() +
  geom_sf(data = rail, color = "grey") +  
  geom_sf(data = conf.matrix.station.sf %>%
            filter(country.code == 'be'), 
          aes(color = count), size = 1) +
  labs(title="Confustion Matrix Across Stations",
       subtitle = ' ',
       caption = 'Figure 5.3.1')+ 
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma",
  name = 'Counts')+
  facet_wrap(~CM) + 
  mapTheme()+
  plotTheme() 

ggplot() +
  geom_sf(data = rail, color = "grey") +  
  geom_sf(data = error.station.sf %>%
            filter(country.code == 'be'), 
          aes(color = count), size = 1) +
  labs(title="Prediction Errors Across Stations",
       subtitle = ' ',
       caption = 'Figure 5.3.2')+ 
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma",
  name = 'Counts')+
  facet_wrap(~Error) + 
  mapTheme()+
  plotTheme() 

5.4 ROC Curve for Model

Theoretically, the ROC curve indicates a TP vs. FP rate at one certain decision threshold, while AUC provides an aggregate measure of performance across all possible classification thresholds. Shown in the graph below, for every additional improvement in true positive rate, the model will make a greater proportion of false positive errors. Moving from a 75% to a 100% true positive rate dramatically increases the false positive rate. The trade-off has been triggered by such diminishing.

The AUC of 0.7997 indicates that for the sake of ranking a random positive example more highly than a random negative example, this model need to set the threshold no less than 0.7997.

testProbs1$Outcome1 <- as.numeric(testProbs1$Outcome1)
# auc(testProbs1$Outcome1, testProbs1$Probs1)

ggplot(testProbs1, aes(d = as.numeric(testProbs1$Outcome1), m = Probs1)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#3e7ec2") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = '#a6dbbc') +
  labs(title = "ROC Curve - Model 1\n",
       caption = 'Figure') +
  plotTheme()

5.5 Thresholds Optimization

This section selects optimal thresholds by plotting the confusion matrix outcomes for each threshold. Graphs below indicate that with the increase of the threshold, there is no change of the confusion matrix outcome. This findings indicate that threshold might not needed to modify.

iterateThresholds <- function(data) {
  x = 0.01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      testProbs1 %>%
      mutate(predOutcome_new = ifelse(Probs1 > 0.5, 1, 0)) %>%
      count(predOutcome_new, Outcome1) %>%
      summarize(True_Negative = sum(n[predOutcome_new==0 & Outcome1==1]),
                True_Positive = sum(n[predOutcome_new==1 & Outcome1==2]),
                False_Negative = sum(n[predOutcome_new==0 & Outcome1==2]),
                False_Positive = sum(n[predOutcome_new==1 & Outcome1==1])) %>%
      gather(Variable, Count) %>%
      mutate(Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }
return(all_prediction)
}


whichThreshold_subsidy <- iterateThresholds(testProbs1)
# View(whichThreshold_subsidy)

whichThreshold_subsidy  %>%
  ggplot(.,aes(Threshold, Count, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette4) +    
  labs(title = "Counts by confusion matrix type and threshold\n",
       y = "Count",
       caption = 'Figure 5.4') +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix")) 

whichThreshold_subsidy_new <- 
  whichThreshold_subsidy %>%
  group_by(Threshold) %>%
  summarise(Total_Count = sum(Count))

# View(whichThreshold_subsidy_new)

whichThreshold_subsidy_new %>%
  ggplot(aes(Threshold, Total_Count)) +
    geom_point(color =  "#658feb") +
    geom_vline(xintercept = dplyr::pull(arrange(whichThreshold_subsidy_new, -Total_Count)[1,1])) +
  ylab('Total Counts')  +
  labs(title = "Total Count of Credits by Threshold\n",
       caption = 'Figure 6.3')  + 
 plotTheme() 

VI. Conclusion

6.1 Application

As shown above, a versatile App, named as BeRail would be developed based on the algorithm in this project. There are two primary panels built in BeRail: dashboard and query. In the dashboard interface, transportation planners are able to see an overview of occupancy levels by time and space through various visualization tools such as summary tables, heat maps, bar charts, and plots. On the other hand, the query interface works as a filter engine in which transportation planners can set specific conditions such as station, line, date and time depending on different policy goals. The filtered result can be visualized in the app or exported as a usable data format for future use.

BeRail Dashboard Interface

BeRail Query Interface

Generally, this App would assist policy decision-making and help transportation planners to monitor train occupancy level and optimize the allocation of public resources while decreasing operational costs. For instance, to increase ridership and reduce resource waste, transportation planners can simply filter for stations with constant low occupancy. The result would pinpoint the stations as well as the community for infrastructure improvement.

Connecting to the current situation, the benefits of this App is beyond the previous expectation that this app can help ensure a better distribution of travelers across the various trains and to avoid boarding an overly crowded train to reduce COVID-19 risks.

6.2 Reflection

It is unfortunate that this project has shown some limitations some of which are inherent in the algorithm:

  • First of all, for the sake of simplicity, this project transformed the three-level occupancy to the two-level occupancy. However, this initiative, to some extent, covers up the subtle differences between train lines with low occupancy level and medium occupancy level and hampers the prediction accuracy. For a better model, further efforts should be made on the development of an ordinal logistic regression model for an categorical independent variable which contains three and more, ordered levels.

  • Secondly, instead of analyzing occupancy levels on the basis of each OD set, this project focuses on the occupancy level of each departure station and destination station. This initiative would be under question since the occupancy level should be used for describing the overall passenger density along the train line. However, since the amount of departure stations and the amount of the destination stations are different, such a combination would generate tons of null values across features. Building models on such data would barely capture the train line characteristics. Therefore, this project sets a premise that the overall occupancy levels are always consistent among each station along a certain line, and thus, we can analyze the potential occupancy level of each station.

  • Thirdly, an innate limitation is that this project is built on four-year ago pre-existing data covering only four months. The poor data accessibility disturbs the accuracy and generalization of the algorithm. On the other hand, without real-time data and detailed data on the passengers’ characteristics, such as the trip purposes, special service, the current algorithm might ignore details that result in low accuracy.

Despite the limitations, we still believe this project is beneficial to city planners and train conductors. The previous situations can always shed light on the present problem, and help produce an annual report on the train operation and further improvements.